Skip to content

feat(v9): T110–T113 OpenMeteo + Open Library + arXiv templates, task_registry, RED_TEAM + integration tests, cache expiry file delete#12

Open
kiannidev wants to merge 6 commits intoAffineFoundation:mainfrom
kiannidev:feat/openmeteo-expanded-templates
Open

feat(v9): T110–T113 OpenMeteo + Open Library + arXiv templates, task_registry, RED_TEAM + integration tests, cache expiry file delete#12
kiannidev wants to merge 6 commits intoAffineFoundation:mainfrom
kiannidev:feat/openmeteo-expanded-templates

Conversation

@kiannidev
Copy link
Copy Markdown

@kiannidev kiannidev commented Mar 20, 2026

Summary

This PR includes all updates currently in the branch for Version 9 task coverage:

  • Adds cross-site templates for T110–T113 across OpenMeteo, Open Library, and arXiv
  • Wires those tasks in liveweb_arena/core/task_registry.py
  • Adds/updates validation artifacts and tests
  • Includes a minimal cache correctness fix in liveweb_arena/core/cache.py (delete expired cache file when rejected)

Task Mapping

  • T110 (OpenMeteo): openmeteo_daily_precip_peak_day
  • T111 (Open Library): openlibrary_subject_nested_work_title
  • T112 (arXiv): arxiv_category_infer_title_substring
  • T113 (arXiv): arxiv_category_infer_author_filter

Files/Areas Updated

Core

  • liveweb_arena/core/task_registry.py (register T110–T113)
  • liveweb_arena/core/cache.py (delete expired cache file on reject path)

OpenMeteo templates

  • liveweb_arena/plugins/openmeteo/templates/__init__.py
  • liveweb_arena/plugins/openmeteo/templates/daily_precip_peak_day.py

Open Library templates

  • liveweb_arena/plugins/openlibrary/templates/__init__.py
  • liveweb_arena/plugins/openlibrary/templates/common.py
  • liveweb_arena/plugins/openlibrary/templates/book_work_title_clues.py
  • liveweb_arena/plugins/openlibrary/templates/nested_work_title_substring.py
  • liveweb_arena/plugins/openlibrary/templates/subject_hub_infer.py

arXiv templates

  • liveweb_arena/plugins/arxiv/templates/__init__.py
  • liveweb_arena/plugins/arxiv/templates/category_discovery_hints.py
  • liveweb_arena/plugins/arxiv/templates/category_infer_title_substring.py
  • liveweb_arena/plugins/arxiv/templates/category_infer_author_filter.py
  • liveweb_arena/plugins/arxiv/templates/title_substring_clues.py

Tests / review docs

  • tests/test_openmeteo_integration.py
  • tests/test_version9_cross_site_templates.py
  • tests/RED_TEAM_REVIEW_VERSION9_CROSS_SITE.md

Validation

  • Ran: PYTHONPATH=. pytest -q tests/
  • Result: 428 passed

This adds richer forecast reasoning tasks, shared GT extraction helpers, registry integration, and broad deterministic test coverage to increase evaluation depth and template diversity.

Made-with: Cursor
@kiannidev
Copy link
Copy Markdown
Author

Hi, maintainers.
Please review the PR and give me some feedback.
Thanks

Copy link
Copy Markdown
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #12 — REJECT on direction (Significance Gate failure)

This PR fails the Significance Gate — do not iterate on details.


1. Duplicates existing capability dimensions, fills zero gaps

CLAUDE.md Evaluation Value table identifies these gaps:

  • Time-sensitive events ❌
  • Nested structure navigation ❌
  • Search-driven interaction ❌
  • User-generated content ❌

This PR adds 6 more templates to OpenMeteo (weather numerical computation) — a dimension already fully covered by templates 85-88, with PR #14 adding 3 more. None of the 6 templates address any gap:

PR #12 Template Capability Already covered by
daily_range (max−min temp) Single-page arithmetic T87 (hourly_extrema), T88 (forecast_trend)
precip_window_count (sliding window) Threshold counting PR #14 T99 (hourly_threshold)
humidity_band_hours (count hours in band) Threshold counting PR #14 T99 (hourly_threshold) — almost identical
wind_shift (max consecutive Δ) Hourly scan + arithmetic T87 (hourly_extrema) variant
city_pair_forecast_gap (two-city compare) Cross-city comparison T86 (comparison) — same dimension
comfort_index (formula from 3 fields) Derived metric computation See issue #2 below

Adding 6 templates in a covered dimension while 4 gap dimensions remain empty is the wrong priority.

2. comfort_index has a fundamental design flaw

The template asks the agent to compute CI = T - 0.2W - 0.05H — a formula that does NOT exist on the Open-Meteo website. The agent must:

  1. Read temperature, wind speed, humidity from the page
  2. Apply an arbitrary formula the question defines

This tests arithmetic ability, not web interaction ability. An LLM that reads the three values from the question + makes up plausible numbers could score well. The "comfort index" is not a real metric on Open-Meteo — it's a synthetic computation injected by the template.

CLAUDE.md Template Design §2 (Verifiability): "API response and website display must share the same data source." The comfort index has no data source — it's computed by the template.

3. Template ID conflict

IDs 96-101 conflict with both PR #13 (96-98, OpenLibrary) and PR #14 (99-101, OpenMeteo). This PR was created before either, but the IDs must be coordinated.

4. No Red Team Review, no eval.py

  • Zero Red Team analysis for any of the 6 templates
  • No eval.py results
  • Single Cursor-generated commit with 1808 lines — no evidence of iterative design or manual verification

5. Unrelated scope change: lazy-loading BrowserEngine

The __init__.py change (lazy-load BrowserEngine/BrowserSession) is unrelated to the templates and should not be bundled.

Recommendation

Close this PR. If the author wants to contribute OpenMeteo templates, focus on a capability dimension that is NOT already covered — or contribute templates for an entirely different website that fills a gap (time-sensitive events, nested navigation, search-driven interaction, user-generated content).

@kiannidev kiannidev changed the title feat(openmeteo): add six advanced templates and expanded GT test coverage feat(hackernews): add gap-filling templates for time, nested, search, and UGC Mar 26, 2026
@kiannidev kiannidev changed the title feat(hackernews): add gap-filling templates for time, nested, search, and UGC refactor(hackernews): replace openmeteo expansion with gap-filling templates Mar 26, 2026
@kiannidev
Copy link
Copy Markdown
Author

Hi, @angosr
I’ve reworked the PR direction based on your Significance Gate feedback.
Please check it again.
Thanks

Copy link
Copy Markdown
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: PR #12 — Direction improved, but all 4 templates fail Red Team Check 3

Significance Gate: PASS

The pivot from OpenMeteo duplication to HackerNews gap-filling templates is the right direction. The 4 targeted capability gaps (time-sensitive events, nested navigation, search interaction, user-generated content) align with CLAUDE.md's evaluation value table.

However, all 4 templates fail the mandatory >500 variant space requirement (Red Team Check 3), and several have additional design issues.


BLOCKING: All templates fail Red Team Check 3 — Memorization Space

CLAUDE.md Red Team §3: "Effective variant space must be >500."

Template Parameters Effective Variants Minimum Required
T110 (burst_count) 4 windows × 3 story_counts 12 500
T111 (comment_tree) 5 ranks 5 500
T112 (keyword_scan) 8 keywords × 3 spans 24 500
T113 (karma_gap) 4 rank pairs 4 500

These are 1-2 orders of magnitude below the threshold. An SFT model can enumerate all Q&A pairs for T111 (5 variants) and T113 (4 variants) trivially.

Fix: Expand parameter pools significantly. For example:

  • T110: Add more window sizes + use story_count as a range (5-50) + add keyword filters → hundreds of combos
  • T111: Expand rank range to 1-30, add metric dimension (top-level comments, total descendants, score)
  • T112: Expand keyword list to 50+ terms, add case-sensitivity/partial-match variants
  • T113: Use any two ranks from 1-30 → C(30,2) = 435 pairs; add metric choices beyond karma (created date diff, story count comparison) to exceed 500

BLOCKING: T112 (keyword_scan) doesn't test "search-driven interaction"

The template scans titles on the /newest page for a keyword. This is title string matching on a list page, not search-driven interaction. The HN website has an Algolia-powered search (hn.algolia.com/search?q=...). A true search-driven template would require the agent to use the search functionality.

Verified via live API: keyword "rust" matches 0/30 newest titles, "python" matches 0/30, "cloud" matches 0/30. For 3 of 8 keywords, the answer is likely always "NONE" — violating Red Team Check 6 (cross-parameter collapse).

BLOCKING: T111 (comment_tree) is EASY difficulty, not "nested navigation"

T111 asks "how many top-level comments" for a story. The agent:

  1. Visits /newest
  2. Clicks a story
  3. Counts visible comments

This is a single-hop, single-value read — EASY difficulty per CLAUDE.md §4. The "nested structure" in HN comments (replies, threads, depth) is not tested. A genuine nested navigation template would require traversing comment depth, finding deeply nested replies, or comparing subtree metrics.

BLOCKING: Version 7 conflict

Both this PR ([110-113] as Version 7) and PR #13 ([96-98] as Version 7, already approved) claim Version 7. With PR #14 using Version 8, this should use Version 9 or higher.

BLOCKING: No Red Team Review, no eval.py, no real API GT verification

CLAUDE.md requires all 6 Red Team checks documented with concrete data, plus eval.py or real API GT verification (as PR #13 and #14 demonstrated).

What's good

  1. Direction: Targeting gap dimensions is correct and addresses the original rejection reason.
  2. fetch_newest_api_data: Clean implementation, properly routes /newest separately from homepage.
  3. GT logic: The GT methods are well-structured (proper error handling, not_collected vs fail distinction).
  4. Test coverage: 256-line test file with good coverage of the new helpers.

Required Actions

  1. Expand variant spaces to >500 for all 4 templates (see suggestions above)
  2. Redesign T112 to use actual HN search (Algolia), not title scanning
  3. Redesign T111 to require actual nested structure traversal, not single-value comment count
  4. Fix Version to avoid conflict with merged/approved PRs
  5. Document Red Team 6-check review with concrete data
  6. Add real API GT verification tests (following PR #13/14 pattern)

Resolve task registry version conflict by preserving upstream Version 7/8 entries and moving HackerNews gap templates to Version 9.

Made-with: Cursor
@kiannidev kiannidev requested a review from angosr March 27, 2026 12:32
@kiannidev
Copy link
Copy Markdown
Author

Thanks for the re-review — I fixed all blocking points:

  • Expanded all 4 templates to >500 effective variants.
  • Redesigned T112 to use real HN search (hn.algolia.com), not /newest title scan.
  • Redesigned T111 to require true nested comment-tree traversal (depth-aware metrics).
  • Resolved version conflict after merging latest main (moved to next version slot).
  • Added Red Team 6-check evidence doc: tests/plugins/hackernews/RED_TEAM_REVIEW_GAP_TEMPLATES.md.
  • Added real API GT verification tests: tests/plugins/hackernews/test_gap_templates_real_api_data.py.

Copy link
Copy Markdown
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (3rd pass): PR #12 — Significant improvement, one remaining issue

Resolved issues

  1. Variant spaces expanded

    • T110: 10×10×5×2 = 1,000 (was 12)
    • T111: 30×5×4 = 600 (was 5)
    • T112: 52×20×4×4 = 16,640 (was 24)
    • T113: C(30,2)×3 = 1,305 (was 4)
      All now exceed the 500 minimum.
  2. T112 redesigned to use Algolia search ✅ — Now queries hn.algolia.com with configurable queries, rank extraction, field selection, and point filtering. Genuinely tests search-driven interaction.

  3. T111 redesigned for nested traversal ✅ — GT now walks the comment tree with depth threshold and computes nodes, leaf_nodes, branch_nodes, max_depth. This is real nested structure navigation.

  4. Red Team Review documented ✅ — All 6 checks with concrete data in RED_TEAM_REVIEW_GAP_TEMPLATES.md.

  5. Real API GT verification ✅ — test_gap_templates_real_api_data.py added.

Remaining BLOCKING: Version 7 conflict

This PR registers [110-113] as "Version 7". But PR #13 (already approved and likely merging soon) also claims Version 7 for [96-98], and PR #14 claims Version 8 for [99-101].

Fix: Use Version 9 or higher for [110-113].

Non-blocking: T111 GT requires item-level API data for comment children

The comment tree traversal calls get_item_story(collected, cid) for each child comment. This requires the agent to have visited each comment's item page for the data to be collected. In practice, the GT collector may only have the top-level story's kids list — not the full recursive tree. Verify this works with eval.py or document the expected agent navigation path.

Assessment

The directional pivot and template redesigns are substantial and well-executed. One version conflict fix away from approval.

@kiannidev
Copy link
Copy Markdown
Author

Thanks for the 3rd-pass review — I’ve addressed the remaining points.

@kiannidev kiannidev requested a review from angosr March 27, 2026 16:39
Copy link
Copy Markdown
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final Review (4th pass): PR #12 — APPROVE

All blocking issues resolved.

Resolved

  1. Version 9 ✅ — # Version 9: Hacker News gap-filling templates confirmed in diff. No conflict with PR #13 (V7) or PR #14 (V8).

  2. Comment tree GT collection ✅ — gt_collector.py now merges _comment_items from item detail pages, making child comment data available to T111's depth-based tree traversal. The item merge condition also relaxed from "id" in api_data and "title" in api_data to "id" in api_data to include comments (which may lack titles).

  3. Algolia search data collection ✅ — New hn_search:{query}:{page} key pattern in GT collector for T112's search-based template.

Summary of what was verified across all review rounds

Check T110 (burst) T111 (tree) T112 (search) T113 (karma)
Variant space >500 1,000 ✅ 600 ✅ 16,640 ✅ 1,305 ✅
Gap dimension Time-sensitive ✅ Nested nav ✅ Search-driven ✅ User content ✅
Red Team 6 checks Documented ✅ Documented ✅ Documented ✅ Documented ✅
Real API GT tests Present ✅ Present ✅ Present ✅ Present ✅
Version conflict V9 ✅ V9 ✅ V9 ✅ V9 ✅
GT collector support Existing ✅ _comment_items ✅ hn_search key ✅ user key ✅

Copy link
Copy Markdown
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review WITHDRAWN — PR #12 rejected on deeper analysis

Previous approval is retracted. On re-examination, the 4 templates do not genuinely fill the claimed capability gaps.

BLOCKING: Templates do not test the capabilities they claim

Template Claimed Gap Actual Capability Tested Already Covered By
T110 (burst_count) Time-sensitive events Compare unix timestamps, count within window T75-78 (HN numerical comparison) — same dimension
T111 (comment_tree) Nested structure navigation Count comment children from API data Agent sees flat HTML on HN page — no tree navigation needed
T112 (keyword_scan) Search-driven interaction Use Algolia search with query given in question Violates "NO navigation hints in questions" (CLAUDE.md §3)
T113 (karma_gap) User-generated content Read two numbers from two profile pages, subtract T75-78 (numerical comparison) — same dimension

T110 is not "time-sensitive events"

"Time-sensitive events" means evaluating ability to handle breaking news, deadlines, scheduled events. T110 compares unix timestamps on a list page — this is the same numerical comparison capability as existing HN templates, just using the time field instead of score or descendants.

T111's "nested navigation" is illusory from the agent's perspective

The GT logic recursively traverses the comment tree via API. But the agent sees HN comments as a flat, indented HTML list on the story page. There is no tree to "navigate" — the depth information is visual indentation, not interactive nested structure. The agent reads a rendered page, not a data structure.

Furthermore, the GT requires _comment_items to be collected for each child comment. This means each comment's individual item page must be visited — but the agent has no reason to visit /item?id=<comment_id> since comments are displayed inline on the story page.

T112 gives navigation hints in the question

CLAUDE.md Template Design §3: "NO navigation hints in questions — no URLs, symbols, selectors, or shortcuts. Finding the source is part of the test."

T112 questions say: "Use Hacker News search for 'rust'" — this tells the agent exactly what to search for. The question should test whether the agent can figure out WHERE to find the information, not just read a result from a page it's directed to. The search query IS the navigation hint.

T113 is numerical comparison, not "user-generated content"

Reading karma from /user?id=X and subtracting is a two-page numerical comparison — the same capability as T86 (CoinGecko comparison) or existing HN templates. "User-generated content" should test understanding of posts, comments, discussions — not reading a single integer from a profile page.

What would genuinely fill these gaps

  • Time-sensitive events: A template on a website with event calendars, deadlines, or real-time feeds where the agent must identify what's happening NOW vs what happened before (not just comparing timestamps on a list)
  • Nested structure navigation: A website with actual interactive tree structures — expandable/collapsible sections, threaded discussions with "load more replies" buttons, multi-level category hierarchies requiring clicks to traverse
  • Search-driven interaction: Questions where the agent must FIGURE OUT what to search for based on the question context, not be told the query directly
  • User-generated content: Questions about the semantic content of user posts/reviews/comments — not just metadata (karma, counts)

Recommendation

Close this PR. The templates are well-implemented technically (variant spaces, GT logic, tests are solid), but they test capabilities that are already covered rather than genuinely new dimensions. Contributing templates that truly fill the gaps requires choosing websites and interaction patterns that force the agent into genuinely new behavior.

@kiannidev
Copy link
Copy Markdown
Author

Thanks @angosr — the withdrawn approval was fair: the earlier HN templates claimed four benchmark gaps but did not test them, and T112-style prompts risked violating CLAUDE.md §3 depending on wording.

This update removes that direction and rebinds registry IDs 110–113 to cross-site templates with an explicit gap mapping + review artifacts.

What changed

  • Dropped the four HN “gap” templates and the Algolia / newest / comment-subtree plumbing that only existed for them.
  • Version 9 slots (110–113) now map to:
    • T110 — openmeteo_daily_precip_peak_day: calendar-relative which day (today / tomorrow / day after tomorrow) has the highest daily max precipitation probability; ties → earliest day.
    • T111 — openlibrary_subject_nested_work_title: nested navigation (subject hub → ranked workwork page) + catalog title substring counting; substring is specified via a clue (validated so the clue doesn’t contain the needle). No subject slug/URL in the question.
    • T112 — arxiv_category_infer_title_substring: infer the correct new-submissions stream from prose hints (category_discovery_hints.py) without putting the official category display name in the question; substring clue/needle split as above.
    • T113 — arxiv_category_infer_author_filter: same stream-discovery pattern; author-count threshold on listing rows.

Reviewer-facing artifacts

  • Red team doc w/ variant lower bounds + §3 reasoning: tests/RED_TEAM_REVIEW_VERSION9_CROSS_SITE.md
  • GT smoke tests: tests/test_version9_cross_site_templates.py
  • Open-Meteo snapshot test path also covers T110 in tests/test_openmeteo_integration.py

Validation

PYTHONPATH=. pytest -q tests/

@kiannidev kiannidev requested a review from angosr March 28, 2026 19:04
Copy link
Copy Markdown
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #12 — REJECT on scope and coherence

BLOCKING: Uncontrolled scope expansion

This PR has been reworked 4-5 times:

  1. Original: 6 OpenMeteo templates (rejected — duplicates covered dimensions)
  2. V2: 4 HackerNews gap-filling templates (rejected — didn't genuinely fill gaps)
  3. V3: Now 8+ templates across 4 plugins (ArXiv, HackerNews, OpenLibrary, OpenMeteo) + core infrastructure changes (cache.py, gt_collector.py)

Each iteration ADDS scope instead of converging. 1,327 additions across 4 plugins in a single PR is not reviewable.

BLOCKING: PR description does not match code

The title says "replace openmeteo expansion with gap-filling templates" and the body describes 4 HackerNews templates. The actual code now includes:

  • 4 ArXiv templates (category_discovery_hints, category_infer_author_filter, category_infer_title_substring, title_substring_clues)
  • OpenLibrary templates (book_work_title_clues, nested_work_title_substring, subject_hub_infer)
  • 1 OpenMeteo template (daily_precip_peak_day)
  • HackerNews common.py changes + API client changes
  • Core cache.py and gt_collector.py modifications

None of this is described in the PR body.

BLOCKING: Core infrastructure changes bundled with templates

cache.py and gt_collector.py are core pipeline files. Changes to these must be in separate PRs with their own justification and testing, not bundled into a template PR.

Recommendation

Close this PR and start fresh with focused, single-plugin PRs:

  1. One PR per plugin (e.g., "feat(arxiv): add 2 search-driven templates")
  2. Each PR ≤ 300 lines, with updated description matching the code
  3. Core infrastructure changes (cache.py, gt_collector.py) in a separate PR
  4. Each PR must independently pass Red Team Review and include real API GT verification

The pattern of expanding scope on each rejection is not productive. Smaller, focused PRs are easier to review and merge.

Scope (matches diff):
- OpenMeteo: daily_precip_peak_day (T110)
- Open Library: nested subject/work title drill-down (T111)
- arXiv: category inference title substring + author filter (T112–T113)
- task_registry wiring, RED_TEAM review, integration tests

Infra:
- fix(cache): delete expired cache file when _load_cache rejects stale entry
- Remove stray liveweb_arena/plugins/hackernews/templates/common.py (not on main)

No agent loop, gt_collector, taostats, or hackernews plugin changes.

Made-with: Cursor
@kiannidev kiannidev changed the title refactor(hackernews): replace openmeteo expansion with gap-filling templates feat(version9): T110–T113 cross-site templates + registry + tests Mar 30, 2026
@kiannidev kiannidev changed the title feat(version9): T110–T113 cross-site templates + registry + tests feat(v9): T110–T113 OpenMeteo + Open Library + arXiv templates, task_registry, RED_TEAM + integration tests, cache expiry file delete Mar 30, 2026
@kiannidev kiannidev force-pushed the feat/openmeteo-expanded-templates branch from bd7debe to d0c3b6a Compare March 30, 2026 18:57
@kiannidev
Copy link
Copy Markdown
Author

Thanks — I’ve scoped the PR to templates + registry + tests and dropped unrelated core/plugin changes. Title/description now match the diff. There’s a one-line cache.py fix for expired-cache deletion; I can move it to a separate PR if you want zero core here. Happy to split by plugin if that’s the preferred workflow.

@kiannidev kiannidev requested a review from angosr March 30, 2026 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants